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Abstract. Protein interaction networks (PIN) are popular means to visualize the proteome. 
However, PIN datasets are known to be noisy, incomplete and biased by the experimental protocols 
used to detect protein interactions. This paper aims at understanding the connection between true 
protein interactions and the protein interaction datasets that have been obtained using the most 
popular experimental techniques, i.e. mass spectronomy (MS) and yeast two-hybrid (Y2H). We 
show that the most natural adjacency matrix of protein interaction networks has a separable form, 
and this induces precise relations between moments of the degree distribution and the number of 
short loops. These relations provide powerful tools to test the reliability of datasets and hint at 
the underlying biological mechanism with which proteins and complexes recruit each other. 


1. Introduction 

A protein interaction network (PIN) is a graph where nodes i = 1... IV represent proteins and 
links represent their interactions. This graph is encoded in an adjacency matrix a = { aij}, whose 
entries denote whether there is a link between proteins i and j (ay = 1) or not (a^- = 0). However, 
there is ambiguity in its definition, arising from the non-binarity of the underlying biochemistry. 
For example, three proteins may form a complex, but may not interact in pairs. Assigning 
binary values to intrinsically non-binary interactions requires further prescriptions, which vary 
across experimental protocols and lead in practice to different graphs. Moreover, different 
experiments measure protein interactions in different ways, which causes further biases EM- 
For quantitative studies of the effects of sampling biases on networks see e.g. 0EUZ1EUHUSUID]. 

In this paper we seek to establish the connection between true biological protein interactions 
and protein interaction datasets produced by the most popular experimental techniques, mass 
spectronomy (MS) and yeast two-hybrid (Y2H). We argue that the most natural network matrix 
representation of the proteome has a separable form, which induces precise relations between 
the degree distribution and the density of short loops. These relations provide simple tests 
to assess the reliability and quality of different data sets, and provide hints on the underlying 
(evolutionary) mechanisms with which proteins and complexes recruit each other. Our study 
also provides a theoretical framework to discriminate between ‘party’ and ‘date’ hubs in protein 
interaction networks, see e.g. ng and references therein, and addresses several intriguing 
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Figure 1. Bipartite graph (or ‘factor graph’) representation of protein interactions. The protein 
species * = 1... TV are drawn as circles, and their complexes p = 1... aN as squares. We write the 
degree of protein i as di (the number of complexes it participates in), and the degree of complex 
p as qn (the number of protein species it contains). The bipartite graph gives more detailed 
information than the conventional PIN with protein nodes and pairwise links only. For instance, 
one distinguishes easily between different types of ‘hub’ proteins: ‘date hub’ proteins connect to 
many degree-2 complexes, whereas ‘party hub’ proteins connect to a high degree complex. 


questions concerning the universality of protein and complex statistics across species. For 
example, given N protein species in a cell, what is the number of complexes they typically 
form, i.e. to what extent is the ratio complexes/proteins conserved across different species? Is 
the distribution of complex sizes peaked around ‘typical’ values, or does it have long tails? How 
is this mirrored in the protein promiscuities, i.e. the propensities of proteins to participate in 
multiple complexes? Does the power law behaviour of the degree distribution of protein interaction 
networks perhaps result from tails in the distribution of complex sizes and protein promiscuities? 

We tackle the above questions using an approach that is entirely based on statistical 
properties of graph ensembles. In section [2] we first define our models. Sections [3j [4] and [5] 
are devoted to the derivation of properties of distinct separable graph ensembles which mimic 
protein interaction networks, each reflecting different possible mechanisms for complex genesis. 
In section [6] we test these properties in synthetically generated graphs, and in section [7] we do the 
same for protein interaction networks measured by MS and Y2H experiments. We end our paper 
with a summary of our conclusions, and suggest pathways for further research. 

2. Definitions and basic properties 

2.1. The bipartite graph representation of the proteome 

Proteins are large and complicated heteropolymers, which can bind in specific combinations to 
form stable molecular complexes. We consider a set of N protein species, labelled by i — 1... N. 
We assume that the number of stable complexes p scales as p — aN where a > 0, and we label the 
complexes by p — 1... aN. We can represent this system as a bi-partite graph HP, see Figure 
[lj with two sets of nodes. The set u p represents proteins (drawn as circles), the set v c represents 
complexes (drawn as squares), and a link between protein i <G u p and complex p 6 u c is drawn 
if protein i participates in complex p. This graph is defined by the N x aN connectivity matrix 
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^ = {£f}, where = 1 if there is a link between i and /a, and = 0 otherwise. For simplicity 
we do not allow for complexes with more than one occurrence of any given protein species. 

In the bipartite graph one has two types of node degrees: the degree d,:(£) = Y./, (or 
‘promiscuity’) of each protein i gives the number of different complexes in which it is involved, and 
the degree q = Yi (or ‘size’) of each complex /i gives the number of protein species of which 
it is formed. We define the distribution of promiscuities in graph £ as p(d |£) = TV -1 Yid dd ygy 
with the average promiscuity (d(£)) = Yddp(d |£), and the distribution of complex sizes as 
p(q |£) = (aiV) -1 X)“=i 5 q q £y with the average complex size (q(£)) = Y g qp(q |£)- Since the 
number of links is conserved, we always have (d(£)) = a(q(£)) for any bipartite graph 


2.2. Link distribution in the bipartite graph 


Since we generally do not know the microscopic bipartite graph £, we will regard it as a quenched 
random object. Several natural choices can be proposed for its distribution p(£). If we assume 
that complexes recruit proteins, independently and with the same likelihood, we are led to 


Pa(€) = n 


l[L 


% 

N 


V\i 


+ 1 - 


N 


J C>o 


( 1 ) 


with S xy = 1 for x = y and 0 otherwise, and where the {q^} are distributed according to 
P(q) = (aiV) -1 Y./,, d q . qil ■ For graphs £ drawn from the ensemble (JlJ) and N —> oo, each complex 
size q^Q) is a Poissonian random variable with average q d . and all protein promiscuities d,;(£) are 
Poissonian variables with average (d) = a(q), since 


P(d) = hm (5 d y p) = 

N—>00 
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( 2 ) 


In the scenario ([Tj) complexes have sizes that are determined e.g. by their functions, and this 
controls the promiscuities of the recruited proteins. Alternatively one could assume that the 
likelihood of a protein participating in a complex is driven by its promiscuitiy, leading to the 
‘dual’ ensemble 


Pb(£) = n 


l/i L 


d% 

aN 
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di N 
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( 3 ) 


where the {di} are distributed according to P(d) = N _1 Yi Sd,di- Here as N —> oo the protein 
promiscuities dj(£) are Poissonian variables with averages d- t . whereas all complex sizes g M (£) are 
Poisson variables with identical average (q) = ( d)/a , since 

p(q) = l im = J im [ ^e iuq (e-^i^) = 

TV—>-oo N^-oo J— n Z7T 


d uj 


iajg+HI( e iw -l) _ e ~(d)/a 


\(d)/a) q /q\ 


( 4 ) 


J-ir 27r 

In this second ensemble proteins have intrinsic promiscuities, determined e.g. by the number of 
their binding sites, their polarization and so on, and these drive their recruitment to complexes. 
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A third obvious choice is the ‘mixed’ ensemble 

di (j js ~ ( diq^ 

-OpV ! + 
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( 5 ) 


aN (q) aN(q)J 

where all protein promiscuities and complex sizes are constrained on average, i.e. (dj(£)) = di 
and (<7 m (£)) = with {di} and {q^} distributed according to P(d) and P(q). Here protein 
binding statistics are driven both by complex functionality and protein promiscuity factors. The 
mixed ensemble (J5| reduces to (JTj) for the choice P(d) = dd,a( q ), and to ([3]) when P(q) = 8 q ,( q )- 
By determining which of the above ensemble reflects better biological reality, we will thus learn 
about the mechanisms with which complexes and proteins recruit each other. 

The above three ensembles become equivalent when q t , : — {q) V /x and di = a(q) V i. In 
that case complex sizes and protein promiscuities are homogeneous, and the recruitment process 
between proteins and complexes is fully random. Bipartite graphs drawn from (|TJ) were found 
to have modular topologies, and to accomplish parallel information processing for suitable values 
of the parameter a mm- Their ensemble entropy has been calculated in [15] . One can show 
easily that if one replaces the soft constraints on the local degrees in our soft-constrained graph 
ensembles (TJ3 ) by hard constraints, then one finds asymptotically the same distributions ( 2|4 ). 
Finally, we note that all three ensembles (IJ3J5) are of the form p(£) = so there are 

no correlations between the entries of £. This strong assumption of our models will need to be 
checked a posteriori. 


2.3. Accounting for binding sites 

In all PINs each protein is reduced to a simple network node, in spite of the fact that proteins 
are in reality complex chains of a mi noacids with several binding domains. Here we show that the 
ensembles introduced in the previous section can accommodate the presence of multiple binding 
sites when these are equally reactive. Let us first assume that each protein has d functional reactive 
amino-acid endgroups. When two such proteins bind, the resulting dimer has 2d — 2 unused 
reactive endgroups, a trimer has 3d — 4 endgroups, and a fc-rner has kd — 2 (k — 1) = (d — 2 )k + 2 
endgroups. If all endgroups are equally reactive, the a priori probability that a protein i is part 
of a complex /i is given by 

d[{d - 2 )q fl + 2] q^d 


= 1 ) = 


( 6 ) 


Z aN(q ) 

where the last approximate equality holds for d 1 and Z = ^T /t q^d = aN(q)d. This corresponds 
to ensemble ([Tj) , with the choice d = a(q). If proteins have different endgroups dj, 

di[(d 2)q fl T 2] diq^ 


— 1) ~ 

’ ~ aN(q)d ~ aN(q) 
where d = N _1 i di , leading to ensemble ([s]) . If the variability of q M is small, q M ~ (q), 

di 


P(C = 1 ) = 


aN 


(7) 


( 8 ) 


and we retrieve ([3]). The assumption of unbiased interactions between proteins with varying 
individual binding affinities has been supported in 
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Protein detection experiments seek to measure for each pair (i,j) of protein species whether they 
interact in any complex, and assign an undirected link between nodes i and j if they do. Hence 
the PIN adjacency matrix a = {a^} resulting from such experiments can be expressed in terms 
of the entries of the bipartite graph £ in Figure [T| via 

aN 




(9) 


M= 1 


and an — 0 V i, with the convention 0(0) = 0 for the step function, defined by 6(x > 0) = 1 
and 9(x < 0) = 0. The aim of this paper hence translates into studying the properties of the 
following ensemble of nondirected random graphs, in which the } are drawn from either of the 
ensembles 0® 


p( a) = 


IT ^a i;j , 

i<j 


fi<oiN ^3 ' 


if 


da ,0 


( 10 ) 


Some properties of (l]3) will turn out not to depend on the choices made for the distributions of 
complex sizes and protein promiscuities, and this leads to powerful benchmarks against which to 


test available PIN datasets. A key feature we exploit in our analysis is that averages over (10) 
can often be replaced by averages over the following related ensemble of weighted graphs 


p(c) = ( 


IK„e 
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IF 


c'' 0 

'-'it jU 


( 11 ) 


Here an entry c t j = J2n< a N *= IN represents the number of complexes in which proteins i and 
j participate simultaneously. For finite d* and a, one finds that in large networks generated via 
( Tf3f5 ) the probability of seeing > 1 is of order 0(N ~ 2 ), and the values of many macroscopic 
observables in the a and c ensembles will, to leading order in N, be identical. 


3. Network properties generated by the g-ensemble 

In this section we study the statistical properties of the ensembles ([TTJ) and (10) upon generating 
the bipartite protein interaction graph £ from ensemble ([Tj), where complexes recruit proteins. 


3.1. Link probabilities 

For the graphs c of ([TT|) we find the following expectation values of individual bonds 


aN 


)=E«r^ = E(|) 2 = ^ 2 ) 
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( 12 ) 


where the brackets on the right-hand side denote averaging over the complex size distribution 


P(q). The likelihood of an individual bond is (see Appendix A) 

P( C ij) = Vcv.r 
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I a ( q2 \* Jf ^ | f " 2 ^ 2 ) 2 Of I r \ 

+ + \~2N2 - 2~N^P 6cij ' 2 ~ Zdcij ' 1 + 



















Protein interaction networks and biology: towards the connection 


6 


+ (^cy.3 - 3(5 Cij .,2 + 35 Cij .,i - 5 Ciji o) + 0(N 4 ) 


so we find for the first few probabilities: 


a(g 2 > «V> 2 a(g 4 ) « 3 (g 2 ) 3 4 

p(°) = 1 - ~ + °( N ) 


IV 
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(13) 

(14) 

(15) 


N N 2 N 3 2 N 3 

and hence 

EpW = i-p(o)-p(i) = o(jv- 2 ), 2 fp(q = {c i 3 )- P (i) = C'(]v- 2 ) (is) 

£>1 ^>1 

The probability to have c tJ ^ 0 is of order O^N^ 1 ), so the graphs generated by (11) are finitely 


connected. Moreover, although the graphs c are in principle weighted, for large N the number of 
links per node that are not in {0,1} will be vanishingly small. 


3.2. Densities of short loops 

We now turn to the calculation of expectation values for different observables in ensemble ( JTT| ) . 
First, we calculate the average number of ordered and oriented loops of length 3 per node, which 


are see 


Appendix A): 

/ 1 


aN 


cycypvpvcPcP 
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(17) 


m 3 (a j^2 C ij C jk c ki/g at X! ^2 

ijk ** pi/p=l 

= a(q 3 ) + OiN- 1 ) (18) 

Calculating the density of loops uil for lengths L > 3 can be simplified by returning to the 
bipartite graph We define a star S n to be a simple (n+l)-node tree in £, of which the central 
node belongs to v c (the complexes), and the n leaves belong to u p (the proteins). Thus S 2 stars 
represent protein dimers, S 3 stars represent protein trimers, and so on. Each link in c corresponds 
to at least one S 2 star in the bipartite graph (which, in turn, can be a subset of any S n star with 
n > 2). Therefore, the total number of S 2 stars in the bipartite graph, 

££««?> = EE<0«?> = EE 4 = o(jv -1)(, 2 ) 

P izjLj p izjLj izjLj p 

has to equate in leading order the total number of links N(k) in graph c, yielding 


(19) 


(q 2 ) = — + 0(N~') 
a 


( 20 ) 


which is indeed in agreement with the result of the direct calculation ( k) = N 1 J2ij( c ij ), using 


(12). Similarly we can obtain the number of loops of length 3, calculated earlier, by realising that 


these loops arise when we have in the bipartite graph either a star S 3 (which can be a subset of 
any S n with n > 3) or a combination of three S 2 stars, where every leaf is shared by two stars. 
The contribution of the number of S 3 stars per node to the number of loops of length 3 is 


fn e «k;® = 4e e 
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( 21 ) 
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The contribution of the combination of three S 2 stars, where each leaf is shared by two stars, is 

( 22 ) 


Tee q Mr L = 4«v> s +oi^ 1 ) 

[fj.,u,p\ [i,j,k] [u,v,p] 


N 6 


iV 


with the square brackets [i, j, k] denoting that the three indices are distinct. The expected density 
of length-3 loops is the sum of an 0(1) contribution from S 3 stars, plus an 0(N~ 3 ) contribution 
from combinations of three S 2 stars that share leaves. For large N the second contribution 
vanishes, and we recover m 3 = a(q 3 ). Likewise, the 0(1) contribution to the density of length-4 
loops comes from S 4 stars in the bi-partite graph, which consist of five sites (four leaves and one 
central node) and four links, each with probability 0(N~ l ). Combinations of two S 3 stars with 
two shared leaves, or of S 2 stars, always involve a number of links at least equal to the number 
of nodes and therefore yield sub-leading contributions. Hence, the density of loops of length 4 is 
1 

N 


r-n 


(23) 


m ‘ = (fffj &&) = a<9 4 } + 0(N- 

V [■ i,j,k,£\ 

More generally, the average density of loops of arbitrary length L is given by 

m L = a(q L ) + 0(N~ 1 ) (24) 

For large N the ratio a and the distribution P(q) of complex sizes apparently determine in full 
the statistics of loops of arbitrary length in c, if the protein interactions are described by ([!]). 

Finally, we note that if rriL gives the number of ordered and oriented loops of length L per 
node, the number of unordered and unoriented closed paths of length L equals rhi = uil/ 6 , since 
there are L possible nodes to start a closed path from, and two possible orientations. 


3.3. The degree distribution 


It follows from (20, 24) that by measuring the average degree ( k ) and the densities rrij J of loops 


of length L we can compute all the moments of the distribution of complex sizes P(q): 

(q 2 ) = ( k)/a , VL > 2 : ( q L ) = m L /a (25) 

This would allow us to calculate P(q) in full via its generating function, provided a and (q) are 
known. However, counting the number of loops of arbitrary length in a graph is computationally 
challenging, and a and (q) are generally unknown. However, it is possible to express P(q) for 


large N in terms of the degree distribution p(k) of c. Specifically, in Appendix B we show that 

roo 

lim p(k) — I dy P(y) e~ y y k /k\ (26) 
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(27) 


and W(q) = qP(q)/(q) is the likelihood to draw a link attached to a complex-node of degree q in 


the bipartite graph $ > . Formula (26) is easily interpreted. The degree of node i in c is given by the 


second neighbours of i in the number £ of first neighbours of node i will thus be a Poissonian 
variable with average a(q), and each of its £ first neighbours will have a degree q r drawn from 
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W{q r ). Clearly, any tail in the distribution W{q) will induce a tail in the distribution p(k), with 
(as we will show below) the same exponent, but an amplitude that is reduced by a factor a{q). 

One can complement (26) with a reciprocal relation that gives P{q) in terms of p(k). To 
achieve this we define the generating functions Qi(z) = YLkP(k)e ~ kz , Q 2 (z) = / 0 °°d?/ P(y)e~ yz and 
Qs(z) = J2 q W(q)e~ zq . We then see from expression (26) for p(k) that 


Qi( z ) = 

JO 

Q 2 ( z ) = 


d y P(y) e y Y — n r ~ = / &V p {v ) e 


y[e z -l] _ 
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(29) 


The first identity can be rewritten as Q i(— log(l — y)) = Q 2 (2/)- Inserting this into (29), allows 
us to express the desired Qz{z) as 

log(l - z)) 


Q 3 M = 1 + 1 ° s ^W = 1 + 1 °s^( 


a(q) 


a(q) 


which translates into 


E P(q)qe zq = (q) + - log^p(fc)(l - z) h 

q> 0 01 k 


(30) 


(31) 


We can now extract the asymptotic form of P(q) from that of p(k). The generating functions 
Qi(z) of degree distributions that exhibit prominent tails, i.e. p(k ) ~ Ck~ y for large k with 
2 < p < 3 (as observed in protein interaction networks mmmm), are for small z of the form 

Qi(z) = 1 - (k)z + CV{l-p)z^ 1 + ... (32) 


where T is Euler’s gamma function ED For small 0 we may use 1 — z ~ e 2 to rewrite ( |30[ ) as 
logQi(-) - a(q)[Q 3 (z) - 1] (33) 


Combining this with (32) then gives, for small z, 

- {k)z + Cr(l - 11 )V _1 ~ a(g)[<2 3 (z) - 1] 


(34) 


Hence, for small z, Q 3 (z) has the same form as Qi(z), 

Qs( z ) = 1 - ^T\ z + -7T r (l-/^)^ 1 ( 35 ) 

a(q) a(q) 

Therefore W(q) behaves asymptotically in the same way as p(k), i.e. W(q) ~ {C / a{q))q~ y . This, 
in turn, gives 

P(q) ~ {C/a) q-^ 1 (36) 

The complex size distribution P(q) in ([Tj) decays faster than the degree distribution of the 
associated c, so fat tails in the degree distribution of protein interaction networks can emerge 
from less heterogeneous complex size distributions. In particular, complex size distributions with 
a finite second moment (but diverging higher moments) give scale-free degree distributions in c. 
This is consistent with the intuition that, while large hubs are often observed in protein interaction 
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Figure 2. Symbols: theoretical (. . .)th versus measured (.. .) m values of observables ( k ), (k 2 ), 
m 3 and TO 4 in synthetically random graphs c with N = 3000, defined via ( fifTT] ) for a power-law 
distributed complex size distribution P(q). Theoretical values are given by formulae (37) for (fc), 
(38) for ( k 2 ), (24) and (40) for ?n 3 and (24) and (41) for 1714 . Dotted lines: the diagonals (as a 
guides to the eye). 


networks, super-complexes of the same number of proteins are unlikely to be stable. Indeed, many 
interactions in hubs are ‘date’ type, as opposed to ‘party’ type m Our framework allows us 
to discriminate between different type of hub proteins, and suggests that heterogeneities in PINs 
may emerge from homogeneous protein ‘dating’ and moderately heterogenous protein ‘partying’. 


3.4■ Relations that are independent of P(q) and a 


The first two moments of p{k) are given, to leading order in N, by (see Appendix B) 
{k) = a(q 2 ) + 0(N-') 


(37) 


(38) 


which is in agreement with (20), and 

( k 2 ) = a(q 2 ) + a(q 3 ) + a 2 (q 2 ) 2 

The latter is easily interpreted in terms of the underlying bipartite graph: ( k 2 ) is the average 
density of paths of length two, so it has a contribution from ( k) = a(q 2 ) due to backtracking, 
plus a contribution from pairs of S '2 stars that share a node, whose density is 


y £ £« 1 W£> = y £ £ = «v> 2 . 

[ijk] v [ijk] 


(39) 
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plus a contribution from S 3 stars, whose density is a(q 3 ) (as shown earlier). Combining (38) with 


(25) gives us a relation between average and width of the degree distribution of c and its density 


of length-3 loops. Remarkably, this relation is completey independent of a and P(q): 

m 3 = ( k 2 > - ( k ) 2 - (k) (40) 

This identity and others, which all depend only on the separable underlying nature of the PIN 
and the assumption of complex-driven recruitment of proteins to complexes, can be derived more 


systematically from (31) by expanding both sides as power series in z and comparing the expansion 


coefficients. This gives a hierarchy of relations between moments of p(k) and P(q ), and hence (via 


(24)) between moments of p{k) and densities of loops of increasing length, that are all completely 
independent of a and P(q). At order z 2 one recovers (40). The next order z 3 leads to 


m 4 = (k 3 ) — 3 (k 2 ) + 2 (k) + (k)((k 2 ) — ( k) — 2 (k) 2 ) 

= ( k 3 ) — 3 (k 2 ) + 2 (k) — ( k) 3 — 3 (k)m 3 (41) 

To test these asymptotic identities in finite systems, we generate random graphs c of size N = 3000 
according to 00 , and we compared the measured values of m 3 and m 4 in these random graphs 
with the predictions of formulae (40) and (41), respectively. We show the results in Figure [2j 


3.5. Link between a and c graph definitions 

In conventional experimental PIN data bases one records only whether or not protein pairs 
interact, not the number of complexes in which they interact. Hence, protein interactions are 
normally represented in terms of the adjacency matrix a = {a^}, which is related to the weighted 
matrix c = {cy} via = 0(cij) V (i j), with the convention for the step function 0(0) = 0. We 
therefore have p(a^) = {S Cid ,o)S aij0 + (1 — {S Cij ,o))d ai:j ,i- However, the links {a^} are correlated. 
In Appendix C we derive the relation between the expected values of different graph observables 
for the two graph ensembles p(a) and p(c). Denoting averages in the a ensemble as (.. .) a , and 
using the usual notation (...) for averages in the c ensemble, one finds that for large N the first 
two moments of the degree distributions and the first two loop densities in the two ensembles are 
identical: 


( k )a 


(k 2 )c 




m A 


= (k) + 0(N 


~ N ^ 1 

u 


„o>] = a(q 2 ) + 0 (N -) 


-u 


y y (aijdjk) = a(g 2 ) + a(q 3 } + a 2 (q 2 ) 2 + 0(N x ) 

i+j+k 

= (k 2 ) + 0(N~ 1 ) 

= J2 ( aijajkdki) = a {q 3 ) + O(N^) 

= m 3 + OiN- 1 ) 

= XT ( a L a jkakf.au) = Oi(q 4 ) + O(N^) 

= m 1 + 0(/V- 1 ) 


(42) 


(43) 


(44) 


(45) 
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Square brackets underneath summations again indicates distinct indices, which excludes 
backtracking in the counting of length-4 loops. Apparently, the ensembles p( a) and p( c) are 
asymptotically equivalent with regard to the statistics of these four quantities. We will see in the 
next section that this equivalence holds also for the ‘dual’ ensemble ([3j) . To test the above claims 
we compute and show in Figure [3] the above observables in synthetic graphs c and a generated 
randomly from (lOpT ), where the random bipartite interaction graph £ is drawn from (JT|). 


4. Network properties generated by the d-ensemble 


In this section we will derive properties for the network ensembles ( TofTT ) upon assuming that the 
statistics of the underlying bipartite protein interaction network are given by (J3]) , i.e. are protein- 
driven as opposed to complex-driven. In spite of the superficial similarity between definitions (J2]) 
and Q, the expectations of graph observables in the two ensembles are found to be remarkably 
different. 


4-1. Link probabilities 

We start by calculating the link expectation values in the weighted graphs 




) = ?) = 


didj 

~aN 


(46) 


(47) 


Hence the random graphs c are again finitely connected, now with 

Averages over d refer to the distribution P(d) of protein promiscuities in the bipartite graph £. 
The result (47) can also be written as ( k) = a(q) 2 , and is thus notably different from the earlier 
expression ( k) = a(q 2 ) found in the g-ensemble. The link likelihood is calculated in Appendix A 
and shows again that p(cy > 1) = 0(N~ 2 ). 


4-2. Densities of short loops 

We can calculate the density of length-3 loops similar to how this was done for the g-ensemble 
in the previous section. Again these are given, to order Oil), by the S 3 stars in the bi-partite 
graph, since the contribution from combinations of S 2 stars is as before (9(A r_1 ). Here we obtain 


[ijk\ M 


(48) 


[ijk\ M 

For loops of arbitrary length L this generalises to 

m,L = (d) L /a L ~ l (49) 

Interestingly, the densities rriL of short loops and the average connectivity (k) depend on P(d) 
only through its first moment. Promiscuity heterogeneity apparently cannot affect the densities 
of short loops. I 11 the present ensemble these densities must therefore be identical to what would 
be found in a randomly wired bipartite graph. This prediction will be confirmed in simulations. 
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In Appendix B we calculate the asymptotic degree distribution of c for the protein-driven complex 
recruitment model (J3|, giving 


P( k ) = = E^)E( e drf 7*0 (e 


(£(d)' 


TV—)>oo N 


a 


/k\ 


(50) 


d> 0 i 

This result is again understood easily: the number of neighbours of a node i is a Poissonian 
variable £, with average d, where d is now drawn from P(d). Each of the l first neighbours 
will have a degree which is a Poissonian variable with average ( d)/a , so the number k of second 
neighbours of i in the bipartite graph is a Poisson variable with average £(d)/a. Equation (50) 
shows that a tail in the promiscuity distribution P(d) will induce a tail in the degree distribution 
p{k ) of c. The link between the two distributions is again most easily expressed via generating 
functions. Upon defining Qi(z) = )T) fc p(/c)e _2:fc and Q±(z) = J2d P(d)e~ zd , we obtain from (50): 


-d 


Q iM= ew 

d> 0 


For z ~ 0 this gives 

Qi(z) - Qi{z(d)/a) 


£ (de <d)(e ' z - 1)/a )7^! = Q 4 ( 1 


_ p ( d X e z -!)/« 


) 


(51) 


(52) 


Hence, if p{k) decays for large k as p(k) ~ Ck M with 2 < p, < 3, then via (32) we infer that 

Q 4 {z(d)/a) ~ 1 - (k)z + CT(l- (53) 

Equivalently, 

Qa(x) ~ 1 — a(k)x/(d) + CT(l—ii)(a/{d)Y~ l x^- 1 (54) 

This implies that for large d the promiscuity distribution will be of the form P(d) — Cdr where 
C = C{al^d)Y~ x = C(q) 1 ~ >i (55) 


Any tail in the promiscuity distribution will produce the same tail in the degree distribution of c, 
but with a rescaled amplitude. Fat tails in the degree distribution of protein interaction networks 
can thus arise from equally heterogeneous ‘dating’ interactions between proteins, combined with a 
homogeneous distribution of ‘party’ interactions. Short loops are boosted by broad distributions 
of complex sizes, since large complexes in the bipartite graph induce large cliques in the network c. 
The d-ensemble ([3]) , which attributes any heterogeneity in p{k) to heterogeneity of protein binding 
promiscuities, generates separable PIN graphs c with the least number of loops. Conversely, 
the g-ensemble (JT|) , which attributes all heterogeneity in p(k) to heterogeneity in complex sizes, 
generates separable PIN graphs c with the largest number of loops. 


4-4- Relations that are independent of P(d) and a 

The first two moments of the degree distribution p{k) of the separable PIN networks c are 

(k) = y kp(k) = y P(d)Y, = ^) 2 /“ 


(56) 
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(k 2 ) = k 2 p(k) = Y P(d ) E 


-d 


d e 


£{d)y 2 £(d) 


d 

\ 3 / _ 2 


/twy + 

V rv ) 


t\ A a 


a 


= (, dy/ct + ( dy/a 2 + ( d)\d 2 )/a - 


(57) 


Combination of (63), (57) and (48) now yields the relation 

(d 2 )/a = ((k 2 )-(k)-m 3 )/(k) (58) 

which still involves (d 2 ) and a. We can also find an alternative expression for the density of loops 


of length 3 by combining (63) and (48) 


m 3 = ( k) 3,2 /Va 


(59) 


Unfortunately, neither of our two expressions for m 3 , (58) nor (59), are useful, because the protein 


promiscuities distribution P(d) and the ratio a are generally unknown. Access to information on 
these quantities via future detection experiments may therefore be extremely welcome in support 
of theoretical modelling of protein interaction datasets. To make progress, we need to derive 


relations for graph observables that are independent of a and P(d). We note that (49) yields 

VL > 3 : rriL+i jm^ = (d)/a (60) 


This can be rewritten using (|63|), as 
VL > 3 : 


m L+ i/m L = yf ( k)/a 


(61) 


On the other hand, we know from (59) that m 3 /(k) = yj ( k)/a. Combining the above formulae 


allows us to establish the following relation, that now is completely independent of P(d) and a: 

1714 = ml/(k) (62) 

Again we have tested the various formulae in synthetically generated graphs, see Figure |4j 

4- 5. Link between a and c graph definitions 

As a final step, we check whether the observables m 3 and are indeed the same for the two PIN 
definitions (10, 11), with the bipartite graph of our protein-driven ensemble ([3]), since protein 
detection experiments provide the binary matrix a as opposed to the weighted graph c for which 
(66) was derived. Again we denote averages relating to a as (.. .) a , and those relating to c as 


(...). For the moments of the degree distributions we find the differences to be negligible: 

1 (d ) 2 

= K !>«)<. = + 0(N- v t = (k) + 0(N- 1 ) 

ij 


(63) 


(d ) 2 (d ) 3 (d 2 )(dfi 


(fi 2 ) a — — Y ( a ij a jk) ~ + 


a 


cr 


cr 


+ 0{N~ l ) = (k 2 ) + O^N- 1 ) (64) 


i^j^k 

The same is true for the densities of loops of length 3 and 4: 

X //7\3 


= 


E 


{di 


(dV 

a jk a ki ) = ^ + O^N- 1 ) = m 3 + 0{N~ l ) 


3 N 

iTbAUAO 

m 4 = Y ( dijdjkakiaii) = “3T + O^N -1 ) = m 4 + C>(A^ _1 ) 


(65) 




cr 


( 66 ) 




























Protein interaction networks and biology: towards the connection 


14 


This equivalence between the ensembles p(a) and p(c) when calculating the main average values 
of graph observables for large N implies that large protein interaction adjacency matrices can in 
practice be regarded as having a separable structure. Again, we check our relations (J63J |57j [65j 


66 ), against synthetically generated graphs and show results in figure 4.5 


5. Macroscopic observables in the mixed ensemble 

The two bipartite graph ensembles 0 ! considered so far led to Poissonian distributions either 
for the protein promiscuities di (in the g-ensemble), or for the complex sizes q M (in the d ensemble). 
It is possible to model heterogeneity in both di and qy using the mixed ensemble (J5]) . Due to the 
similarities with previous calculations we can and will be more brief in this section. For ensemble 
(J5]) the expectation values of individual links in the weighted graph c are 

didj(q 2 ) 




\ = = y didjQ * - _ 

/ ^ a 2 {q) 2 N 2 a{q) 2 N 


+ 0(N~ 3/2 ) 


and the average connectivity follows as 


« = X>«> = + 0(A'" 1/2 ) = a (l 2 ) + 0(iV- I/2 ) 


(67) 


( 68 ) 


U 


Full details are found in Appendix A As in previous ensembles, the leading contribution to the 
density of length-3 loops comes from the S 3 stars in the bipartite graphs, now giving 


1 1 ^ ^ didjd k q 3 (d) 3 (q 3 ) _^ 


= ^E E(OT = m £ E 


[ijk] * N m v « 3 (?) N 3 « 2 (<E 


= a(q ) 


(69) 


As before, the heterogeneity in the d affects neither the average connectivity ( k) nor the density 
of triangles m 3 , both are as they were in the g-ensemble. This is confirmed numerically, see Figure 
[ 6 ] The degree distribution for large N in the ensemble p(c) is calculated in Appendix B, giving 


roo 

p(k)= / dy P(y)e~yy k /k\ 
Jo 


where 


P(v) = E W* E I E W(q 1 )... W{q t ) 6[y - Y q r \ 

d t >0 ' ij] ...q[>0 r<£ 


(70) 


(71) 


Again it is possible to relate the asymptotic behaviour of p{k) to that of P(d) and W(q), by 
inspecting the relation between the relevant generating functions. Using our previous definitions 


for Qi(z), Q 2 ( 2 ), Q 3 (^), and Q±(z), we obtain via (70) and (71): 


Qi{z)= [dyP(y)Ye- y (ye~ Z ) k /k'.= [ dy P(y)e~y^ e ^ 

J k J 

= Q 2 ( l-e~ z ) 

d e 1 d l 

Q 2 (z) = Y P(d)e~ d E 77 II ( E W(q r )e~ zqr ) = E P(d)e~ d E 77 Q 3 O*) 

d e l - r= 1 q r d l 

= y p(dY^- Q ' Az) 1 = g 4 (i - q s (r)) 


(72) 


(73) 
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Expanding (72) for small z tells us that Qi(z) — Q 2 ( 2 )■ Substitution into (73) subsequently gives 

Qi(z) - <54(1 - Qs(z)) (74) 


Assuming W ( q ) to have a power-law tail, but with a finite first moment (as in all cases previously 
considered), i.e. W{q) — Kq _7 with 7 > 2 , its generating function Q 3 (z) can be written as 

Q,(z) = 1 - {q 2 )z/(q} + 0(z‘) (75) 


where 5 = min{ 2,7 — 1}. Insertion into (74) then leads to 
Qi(z) ~ Q±(z(q 2 )/(q) - 0{z 5 )) 


If p{k) = Ck with 2 < p < 3, we may use our earlier result (32) and get 

Qi(x - 0{x{q) / (q 2 )) s ) ~ 1 - (k}(q}x/(q 2 )+ Cr(l-lx)((q}/{q 2 ))“- 1 x “- 1 


(76) 

(77) 


If 7 > p we have 6 > p — 1, so we can neglect the second term in the argument of Q 4 
and conclude that the promiscuity distribution has the asymptotic form P(d) = C'd~ fl where 
C = C((q 2 ) / (q)) 1- ^. This means that if W(q) decays faster than p(k) (as in Section [4]), then 
the tail in p{k ) must arise from the tail in P(d). Note, however, that heterogeneities in P(q) will 
affect the amplitude of the power law tail in P(d), which will be smaller by a factor ((q 2 )/(<?) 2 ) 1_At 
compared to the case where P(q) = 5 9i ( g >, where we had C' = C(q) 1 ^. Conversely, if 7 = p we 


have 5 — p — 1, and writing the 0(z 5 ) term explicitely in (76) gives 

QA{z(q 2 )/(q) - AT(l- / n)^- 1 ) = 1 - (k)z + CT(l-p)z»~ l (78) 

Expanding both sides in powers of z and equating prefactors tells us that either C' — 0 
and C = K(d) (i.e. K = C/a(q), which retrieves the case in Section [3j, or 5 = p. with 
K(d) + C((q 2 )/(q ))^- 1 = C. Hence, if P(d) is as broad as W(q), then both contribute to 
the tail in p(k), whose amplitude will be the sum of the amplitudes of the tails in P(q) and P(d). 


We see in (77) that 7 < p is not possible, i.e. W(q) needs to decay at least as fast as p(k). 


In Appendix B we calculate the first two moments of the degree distribution p{k ) of the 


ensemble p(c). This recovers ( 68 ) for the first moment, and for the second moment gives 


(k 2 ) = a(q 2 ) + a(q 3 ) + (d 2 ) (k ) 2 / (dy 


Substituting (| 68 j) and (69) into (79) then leads to 
m 3 = ( k 2 ) - (k) - (k) 2 (d 2 )/(d ) 2 


(79) 


(80) 


The density of length-3 loops depends again on the first two moments of the degree distribution 
p(k), but is also seen to depend on the first two moments of the promiscuity distribution P(d), 
which is unknown. Hence, this relation cannot serve as a test of PIN data quality. It is nevertheless 
useful for comparing the mixed ensemble to the d- and the g-ensembles in synthetically generated 
data. 
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Here we compare the ability of our bipartite ensembles flli to predict properties of the 
associated binary PIN graphs, for synthetic networks that are generated from any of these 
ensembles. We focus on comparing homologous fomulae for the observables (k), ( k 2 ), m 3 and 7714 . 
The synthetic matrices a = {a^} with a tJ e {0,1} are defined as before via a i3 = 0(X) M with 
0 ( 0 ) = 0 , and the links of the bipartite graph £ are generated from the following three protocols. 
In the first protocol, links between nodes (i , p) are drawn randomly and independently, until their 
total number reaches a prescribed limit. In the second protocol, we assign the links prefentially 
to complexes with large sizes. In a third protocol we assign links preferentially to proteins with 
large promiscuities. 

In Figure [ 6 ] we show along the vertical axes the values of ( k) (left) predicted by the three 
ensembles, via formulae (37), (47) and ( 68 ), the predicted values of ( k 2 ) (middle), via (38), (57), 
and (79), and the predicted triangle density m 3 (right), via (40), (58) and (80). All are shown 
together with the corresponding values that were measured in a, along the horizontal axis. As 
expected, the d-ensemble outperforms the other ensembles when links are drawn according to 
d-preferential attachment, whereas the g-ensemble performs better for graphs generated via q- 
preferential attachment. The mixed ensemble performs very similar to the g-ensemble in terms 
of counting triangles, as expected from the reasoning in Section [5j Deviations between the q 
and the mixed ensembles are most evident in the second moment of the degree distribution, 
where the mixed ensemble always leads to values well above those of the q- and the d-ensembles. 
We found in Section [4] that the d-ensemble is indistinguishable from a fully random ensemble 
when calculating ( k } and m 3 , which explains why the d-ensemble predicts the values of these two 
observables perfectly. The other two ensembles are more sensitive to finite size effects, as any 
heterogeneity in the q will boost the number of loops. 

In Figure [7] we show the values of m 3 and m 4 predicted by those formulae that involve 
only measurable graph observables, for the synthetically generated graphs used in Figure | 6 j The 
prediction of m 3 is now obtained from (40) and ( 66 ), for the q- and d- ensembles respectively, and 
m 4 is evaluated using (41) and ( 66 ). In figure [ 8 ] we plot the degree distribution p{k ) of graphs 
with identical values for the number of nodes (N = 3000) and the number of links L = Na(q), 
generated synthetically via the three chosen protocols, together with the distributions P(q) of 
complex sizes and P(d) of protein promiscuities. As explained in Section [5j tails in the degree 
distribution p(k) ~ /c _Al can arise either from a complex size distribution P(q) ~ q~^~ l and 
a homogeneous promiscuity distribution, or from having an equally fat tail in the promiscuity 
distribution P(d) ~ d~ M together with less heterogeneous complex sizes P(q) ~ q~ a ~ l with a > p. 


7. Test against experimental protein interaction data 

I 11 this section we apply the results of our analyses to real publicly available protein interaction 
datasets, obtained via MS (mass spectrometry) and Y2H (yeast 2-hybrid) experiments. The 
detailed quantitative features of the various data sets and their references are listed in Table [TJ 
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Species 

N 

(k) 

h 

^max 

Method 

Reference 

C. elegans 

2528 

2.96 

99 

Y2H 

m 

C.jejuni 

1324 

17.5 

207 

Y2H 

ra 

E. coli 

2457 

7.05 

641 

MS 

m 

H. pylori 

724 

3.87 

55 

Y2H 

E3 

H. sapiens I 

1499 

3.37 

125 

Y2H 

BE! 

El. sapiens II 

1655 

3.71 

95 

Y2H 

PI 

H. sapiens III 

2268 

5.67 

314 

MS 

M 

M. loti 

1803 

3.43 

401 

Y2H 

B5 

P .falciparum 

1267 

4.17 

51 

Y2H 

ra 

S. cerevisiae I 

991 

1.82 

24 

Y2hH 

m 

S. cerevisiae II 

787 

1.91 

55 

Y2H 

H 

S. cerevisiae III 

3241 

2.69 

279 

Y2H 

H 

S. cerevisiae IV 

1576 

4.58 

62 

MS 

PI 

S. cerevisiae VI 

1358 

4.73 

53 

MS 

ra 

S. cerevisiae VIII 

2551 

16.77 

955 

MS 

p] 

S. cerevisiae IX 

2708 

5.25 

141 

MS 

ES] 

Synechocystis 

1903 

3.25 

51 

Y2H 

PI 

T. pallidum 

724 

10.01 

285 

Y2H 

PI 


Table 1. List of the publicly available experimental protein interaction data sets as used in 
the present study, together with their main quantitative characteristics (number of proteins N, 
average degree (k), and largest degree fc max ) and references. 


7.1. Mass spectrometry datasets 

Seven of the experimental PIN datasets in Table [7] were obtained by MS experiments, and they 
involved three distinct biological species, namely S. cerevisiae, H. sapiens and E.coli. Each set 
takes the form of an IV x N matrix of binary entries a %v but with different values of N. 

In Figure [9] we show the results of our analytical predictions for the densities of length-3 
and length-4 loops, as given by the formulae for the bipartite q- and d-ensembles, versus their 
measured values in the MS datasets. The g-ensemble leads to values of the number of short loops 
consistently higher than those predicted by the d-ensemble. This could have been expected, since 
the g-ensemble induces large cliques in the protein interaction networks c and a, which boosts 
short loops. In contrast, the d-ensemble induces a homogeneous distribution for the complex 
sizes, and thereby suppresses the presence of large cliques in the protein interaction networks. 

Remarkably, the values for lenght-4 loop densities of all the MS data sets are in between 
those of the d-ensemble (which thereby acts as a lower bound) and those of the g-ensemble 
(which acts as an upper bound). This suggests a compatibility of data from MS experiments with 
the expected separable form of the proteome network. However, the measured length-3 densities 
are consistently lower than the values compatible with a separable structure of the proteome. 
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7.2. Yeast 2-hybrid datasets 

We tested similarly the compatibility of Y2H data with a separable structure of the proteome, by 
checking whether the measured values for the network observables m 3 and m 4 fall within what 
appeared to be (in MS data) theoretical bounds set by the q- and d-ensembles. We now used 
the 12 PIN datasets in Table [7] that were obtained from Y2H experiments. Results are shown in 
Figure [lOj We observe that Y2H datasets exhibit generally fewer short loops than MS dataset. 
This may be due to the fact that Y2H experiments mostly detect direct binding domain contacts 
in protein interactions, leading to an undersampling of links (and thereby to an underestimation 
of connectivity and loops). However, Y2H data sets still show the same level of compatibility 
with a separable structure of the proteome as the MS datasets did, with measured values of m 4 
that are fully compatible, and values for m 3 that fall below those predicted by the d-ensemble. 
This is quite remarkable, since MS and Y2H experiments are known to measure interactions in 
very different ways. 

8. Conclusions 

In this paper we propose a bipartite network representation of protein interactions, where the two 
node types represent proteins and complexes, respectively. A protein-protein interaction network 
can then be regarded as the result of a ‘marginalization’ of the bipartite network, whereby the 
complexes are integrated out (i.e. summed over). This leads to a weighted protein interaction 
network c with a separable structure. Adjacency matrices of protein interaction networks a are 
then simply the binary versions of the separable c, obtained by the entry truncations = 0(cij), 
with the convention 0(0) = 0. One of the central results of this work is that for sufficiently 
large networks there is an equivalence between the two graph ensembles p(c) and p(a), inasmuch 
as macroscopic statistical properties are concerned, such as densities of short loops and degree 
distributions. This allows us to regard the conventional protein interaction adjacency matrices 
as if they were to have a separable structure, and induces precise relations between expectation 
values of macroscopic graph observables which, remarkably, only depend on measurable quantities 
and on the underlying mechanism with which proteins and complexes recruit each other. They 
are independent of inaccessible microscopic details of proteins and their complexes. 

We considered the two extreme complex recruitment scenarios, one where recruitment is 
either driven solely by protein promiscuities, and one where it is driven by complex sizes. 
Preferential attachment to large complexes (the g-ensemble) favours the presence of large cliques 
in PINs, which boosts the number of short loops. Hence we can reasonably expect that the 
predictions on short loop densities from the g-ensemble will over-estimate the real number of loops. 
Conversely, preferential attachment based only on protein promiscuities (the d-ensemble) leads to 
homogeneous complex sizes, which suppresses large cliques in PINs, leading to an underestimation 
of short loop densities. Remarkably, real protein interaction data from mass-spectronomy and 
yeast 2-hybrid experiments show a density of length-4 loops in between the predictions of the d- 
ensemble and those of the g-ensemble, suggesting a degree of compatibility of these experimental 
data with a separable structure of the proteome. In contrast, both MS and Y2H dataset show 
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densities or length-3 loops that are consistently smaller than all onr theoretical predictions. 

We believe that, by providing a systematic and practical framework for understanding protein 
interaction experiments, our approach may represent a valuable step towards establishing a 
more solid connection between protein interaction datasets and the underlying biology, Universal 
bounds on observables in PINs may become powerful tools for data quality testing. Improved 
versions of the present models, with fit the experimental data better, may open a route to 
infer quantities such as the ratio a, and the distributions of protein promiscuities and complex 
sizes. Such quantities are not available in the current PIN data sets, and are difficult to access 
experimentally. The present work has revealed that the asymptotic forms of these distributions 
can be extracted from the tails of the PIN degree distributions. Finally, our method my shed 
some light on the way protein and complexes recruit one another, in particular, whether this 
recruitment is driven by proteins or by complexes, and may enable us to discriminate between 
‘party hub’ and ‘date hub’ interactions. 
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Appendix A. Link probabilities in the weighted protein interaction network 


In this appendix we derive the likelihood to have a link in the weighted protein interaction network 
Cij = when the are drawn from the ensembles ( Tpp ). 

Appendix A. 1. The q ensemble 
In the g-ensemble we have 
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triangles is obtained writing (A.2) as 

aN 


Front this one reads off directly the values of p(cij = 0), p(cy = 1) and p{cij > 2). The density of 
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This gives 
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Appendix A.2. The d-ensemble 
In the d-ensemble we obtain 
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Appendix A.3. The mixed ensemble 

For the mixed ensemble, the link likelihood is found to be 
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In this appendix we calculate the degree distribution of the weighted protein interaction network 
Cij = hr which the entries are drawn from the bipartite ensembles (Tj3j5), respectively. 


Appendix B.l. The q-ensemble 

In the ^-ensemble, we can calculate p{k ) as follows: 
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Hence, for large network sizes N —y oo we obtain 
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We can rewrite this in terms of the distribution W(q) = qP(q)/{q), which denotes the likelihood 
to draw a link attached to a node of degree q in the bi-partite graph, 
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and upon defining 


P(y) = E 


(«(?))' 




(B.4) 


qi...q e >0 


we finally get to 

roo 

lim p(k) = / d y P(y ) e“V/Jfc! (B.5) 

iV-5-oo Jo 
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We can calculate 
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Appendix B.3. The mixed ensemble 

In the mixed ensemble we have the asymptotic degree distribution 
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Appendix C. The link between observables in the a and c networks 

In this appendix we inspect the relation between expectation values of various observables in the 
ensembles p(a) and p(c). 


Appendix C.l. The q-ensemble 

Denoting averages in the a ensemble as (.. .) a , we have, for the g-ensemblc of bipartite graphs: 
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Finally for loops of length 4, we have 
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Again, the square brackets underneath the summations indicate that all indices are different, to 
exclude backtracking in the counting of loops of length 4. 
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For the d-ensemble, denoting averages relating to a as (.. .) a , we have: 
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<m,> <m.> 

4 a 4 a 

Figure 3 . Symbols: (k), ( k 2 ), m3 and 7714 as measured in synthetic graphs c drawn from ([ll]) 
with N = 3000, shown versus corresponding values found in the binary graphs a drawn from ( |10[ ). 
Bipartite interaction graphs £ are drawn from (JTJ) , with complex size distributions P(q) that are 
Poissonian (left panels) or power law (right panels). Dotted lines: the diagonals (shown as guides 
to the eye). As expected, the values measured in the weighted graphs c are consistently higher 
than in the binary ones, but one finds that these deviations get smaller for increasing network 
sizes N. 
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<m.> 

4 m 


Figure 4 . Symbols: theoretical (.. .)th versus measured (.. .) m values of observables (k), ( k 2 ), 
m 3 and 7714 in synthetic random graphs c with N = 3000, defined via m for a power-law 
distributed promiscuity distribution P(d). Theoretical values are given by formulae (63) for (k), 


(57) for (fc 2 ), (48), (59) and ( 66 ) for m 3 and (49) and ( 66 ) for 7774 . Dotted lines: the diagonals 
(shown as guides to the eye). 
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<m,> <m> 

3 a 4 a 


Figure 5. Symbols: ( k }, (fc 2 ), m 3 and TO 4 as measured in synthetic graphs c drawn from (111 
with N = 3000, shown versus corresponding values found in the binary graphs a drawn from 
(10). Bipartite interaction graphs £ are drawn from ([ 3 ]), with protein promiscuity distributions 
P(d) that have a power law form. Dotted line: the diagonals (shown as guides to the eye). As 
expected, the values measured in the weighted graphs c are consistently higher than in the binary 
ones, but these deviations get smaller for increasing network sizes N. 
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Figure 6. Symbols: theoretical (.. .) t h versus measured (.. .) m values of observables ( k ), 
(fc 2 ), and TO 3 in synthetic random graphs a with N = 3000 and and a = 0.5, generated 
either via random wiring (top panels), q -preferential attachment (middle panels) or d-preferential 
attachment (bottom panels). Dotted lines: the diagonals (shown as guides to the eye). 
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Figure 7 . Predicted versus real m 3 (left) and m .4 (right) for random bi-partite graphs with 
N = 3000 and a = 0.5 genetated via random wiring (top panels), q preferential (middle panels) 


and d preferential (bottom panel), calculated by using formulae ( |40| ), (41), ( 66 ) and obsevables 
appearing in the formulae computed directly from the network. 



Figure 8. Distributions P{q) of complex sizes, P(d) or protein promiscuities, and p(k) of the 
degrees in a (distinguished by markers whom in the panel legends), for random bi-partite graphs 
with N = 3000, a = 0.5 and (q) = 4.8, which have been generated either via random wiring (left), 
via g-preferential attachment (middle), or via d-preferential attachment (right). 
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Figure 9. Left: theoretical predictions m 3 th for the densities of length-3 loops in the PINs, 
as obtained from the g-ensemble (stars) and the d-ensemble (circles), plotted versus the values 
m 3m measured in the different MS datasets. Right: theoretical predictions m 4 th for the densities 
of length-4 loops in the same PINs, obtained from the g-ensemble (stars) and the d-ensemble 
(circles), plotted versus the measured values ?7i4 m . The diagonals are shown as guides to the eye. 
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Figure 10. Left: theoretical predictions m, 3 th for the densities of length-3 loops in the PINs, 
as obtained from the g-ensemble (stars) and the d-ensemble (circles), plotted versus the values 
m 3m measured in the different Y2H datasets. Right: theoretical predictions m 4 th for the densities 
of length-4 loops in the same PINs, obtained from the g-ensemble (stars) and the d-ensemble 
(circles), plotted versus the measured values TO 4 m . The diagonals are shown as guides to the eye. 































